{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# week2 Assignment\n",
    "\n",
    "## 邮件结构\n",
    "\n",
    "- 将邮件的存储格式分为以下几部分\n",
    "   - 'Metadata': 这部分包含从MIME头提取的信息, 如'From', 'Receive', 'Send_time', 'Subject'等等\n",
    "   - 'Content': 这部分会包含所有的内容信息, 如'Body_text', 'Body_html', 'Recite', 'Attachment', 'Signature'等等\n",
    "   - 'Entities': 包含邮件中提到的各种实体, 如'Name', 'Organization', 'Time', 'Position', 'Tel'\n",
    "   - 'Relation': 包含邮件内的各种关系, 如邮件之间的关系, 邮件内容的语义关系.\n",
    "\n",
    "## 思路\n",
    "\n",
    "- 利用flanker提取MIME头, 将信息初步提取, 这是可以完成'Metadata'部分信息的提取\n",
    "- 利用Regex将邮件内容分解, 段落, 引用, 附件, 签名档一次提取出来.\n",
    "- 利用NLTK, jieba等分词工具, 进一步细化, 提取各个实体.\n",
    "- 最后进行更深入的关系分析提取.\n",
    "- 这样层层递进, 逐渐深入,\n",
    "\n",
    "\n",
    "## 难点\n",
    "\n",
    "- 分段有可能比较混乱, 这里可能会花一点时间\n",
    "    - 分段还未做\n",
    "- 引用通过'>','>>'来判断\n",
    "    - 引用也未做\n",
    "- 签名档因为比较复杂, 格式不一, 甚至有的没有, 有的特别简单, 信息不够全面\n",
    "    - 能够粗略的提取签名档, 但格式并未统一\n",
    "- 关系表示, 邮件内部, 邮件外部\n",
    "    - 不知道该怎么表示关系.\n",
    "\n",
    "## Tips\n",
    "\n",
    "- 邮件的结尾都是--boundary--\n",
    "    - 此处有坑: 写正则的时候boundary的字符串结尾有'\\_', 还跟html里'!--'有冲突, 总之在用boundary分离邮件时花费了很多时间.\n",
    "    - 反而不如直接用'From .....@....'去分离.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "import json\n",
    "import flanker\n",
    "import jieba.posseg as pseg\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "from flanker import mime\n",
    "from nltk.tag import StanfordNERTagger"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Add models of NLTK  \n",
    "os.environ[\"CLASSPATH\"] = \"/Users/xpgeng/Library/stanford-ner-2015-12-09\"  \n",
    "os.environ[\"STANFORD_MODELS\"] = \"/Users/xpgeng/Library/stanford-ner-2015-12-09/models\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Add Tagger\n",
    "st = StanfordNERTagger('/Users/xpgeng/Github/kg-beijing/class1/week1/homework/xpgeng/english.all.3class.distsim.crf.ser.gz')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 读取数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def prepare_data(filename='2013-11.mbx'):\n",
    "    with open(filename, 'r') as f:\n",
    "        data = f.read()\n",
    "    f.close()\n",
    "    data_list = filter(None, re.split(r'From\\s([\\w+.?]+@(\\w+\\.)+(\\w+))', data))  #  \n",
    "    # Here I have to add twice for-loop, I haven't analyse the reason\n",
    "    for data in data_list:\n",
    "        if len(str(data)) < 500:\n",
    "            data_list.remove(data)\n",
    "    for data in data_list:\n",
    "        if len(str(data)) < 500:\n",
    "            data_list.remove(data)\n",
    "    return data_list  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data_list = prepare_data('2013-11.mbx')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 提取MIME头信息, 邮件内容, 签名档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def extract_headers(msg_string):\n",
    "    mime_dict = {}\n",
    "    msg = mime.from_string(msg_string)\n",
    "    msg_list = msg.headers.items()\n",
    "    mime_keys = ['From', 'Date', 'Cc', 'To', 'Subject', ]\n",
    "    for item in msg_list:\n",
    "        if item[0] in mime_keys:\n",
    "            mime_dict[item[0]] = item[1]\n",
    "    return mime_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Cc': u'\\u4e2d\\u6587HTML5\\u540c\\u6a02\\u6703ML <public-html-ig-zh@w3.org>',\n",
       " 'Date': u'Wed, 6 Nov 2013 20:09:11 +0800',\n",
       " 'From': u'\\u8463\\u798f\\u8208 Bobby Tung <bobbytung@wanderer.tw>',\n",
       " 'Subject': u'Re: \\u95dc\\u65bc<cite>\\u5143\\u7d20\\u6700\\u8fd1\\u7684\\u5b9a\\u7fa9\\u4fee\\u6539',\n",
       " 'To': u'Yijun Chen <ethantw@me.com>'}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "extract_headers(data_list[3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def create_name_list(data_list):\n",
    "    name_list = []\n",
    "    p = re.compile(ur'\\\"?([\\w\\s\\(\\)]+|[\\x80-\\xff]+)\\\"?\\s<')\n",
    "    for message_string in data_list:\n",
    "        msg = mime.from_string(message_string)\n",
    "        for item in msg.headers.items():\n",
    "            if item[0] == 'From':\n",
    "                name = p.search(item[1].encode('utf-8')).group(1)\n",
    "                name_list.append(name)\n",
    "    name_list = list(set(name_list))\n",
    "    name_list += ['Cindy', 'Kenny', 'Chen Yijun', 'Chunming', '-ambrose']\n",
    "    name_list.remove('com')\n",
    "    name_list.remove(' Chunming')\n",
    "    name_list.remove(' Bobby Tung')\n",
    "    name_list.remove('Hawkeyes Wind')\n",
    "    return name_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def extract_signature(message_string, name_list):\n",
    "    signature_list = []\n",
    "    for name in name_list:\n",
    "        p_name = re.compile(r'^%s.+'%name, re.MULTILINE | re.DOTALL)\n",
    "        msg = mime.from_string(message_string)\n",
    "        for part in msg.parts:\n",
    "            if not isinstance(part.body, (type(None), str)):\n",
    "                if p_name.findall(part.body.encode('utf-8')):\n",
    "                    signature_list += p_name.findall(part.body.encode('utf-8'))\n",
    "    signature = None\n",
    "    for item in signature_list:\n",
    "        if len(item) < 300: \n",
    "            signature = item # 已经知道小于300的就一个\n",
    "    if not signature:\n",
    "        return None\n",
    "    elif 'Hawkeyes Wind' in signature or 'Zhiqiang' in signature: # 只能不断添加规则...\n",
    "        return None\n",
    "    elif '<' in signature:\n",
    "        soup = BeautifulSoup(item, 'html.parser')\n",
    "        signature = soup.get_text()\n",
    "        return signature\n",
    "    else:\n",
    "        return signature\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def extract_content(message_string, name_list):\n",
    "    \n",
    "    content_dict = {}\n",
    "    p = re.compile(ur'\\\"?([\\w\\s\\(\\)]+|[\\x80-\\xff]+)\\\"?\\s<')\n",
    "    msg = mime.from_string(message_string)   \n",
    "    for part in msg.parts:\n",
    "        if not isinstance(part.body, (type(None), str)):\n",
    "            content_dict[str(part)] = part.body\n",
    "    signature = extract_signature(message_string, name_list)\n",
    "    content_dict['Signature'] = signature\n",
    "    return content_dict\n",
    "    # key 未修正, 直接用了带()的值, 附件也未区分, 直接根据content-type有什么添加什么"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "name_list = create_name_list(data_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'(text/html)': u'<html><head></head><body style=\"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; \">Hi friends,<br><br>In light of the upcoming TPAC, I\\'d like to suggest a joint meeting between the two IGs. <br><br>There have been some discussion in the Chinese IG on things related to publishing and I thought it will be nice for us to catch up with the Digital Publishing IG.<br><br>Agenda<br><br>0. Mutual introduction<br>1. CSS3 text (some discussion on our side)<br>2. digital publishing requirement for chinese language (Bobby has written a spec/requirement and it\\'d be nice to know how everyone thinks)<br>3. anything else?<br><br>This discussion won\\'t be an exhaustive one, rather it is to put names to faces, discuss the agendas, and hopefully drive future online discussions.<br><br>If everyone is cool, maybe we can do a 90 minute on Thursday during TPAC? <br><br>---<br>Zi Bin Cheah<br>HTML5 Chinese IG chair<br><br><div apple-content-edited=\"true\"><span class=\"Apple-style-span\" style=\"border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; \"><span class=\"Apple-style-span\" style=\"border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; \"><div style=\"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; \"><span class=\"Apple-style-span\" style=\"border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; \"><div style=\"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; \"><div><div><div><br></div></div></div></div></span></div></span></span></div></body></html>',\n",
       " '(text/plain)': u\"Hi friends,\\n\\nIn light of the upcoming TPAC, I'd like to suggest a joint meeting between the two IGs. \\n\\nThere have been some discussion in the Chinese IG on things related to publishing and I thought it will be nice for us to catch up with the Digital Publishing IG.\\n\\nAgenda\\n\\n0. Mutual introduction\\n1. CSS3 text (some discussion on our side)\\n2. digital publishing requirement for chinese language (Bobby has written a spec/requirement and it'd be nice to know how everyone thinks)\\n3. anything else?\\n\\nThis discussion won't be an exhaustive one, rather it is to put names to faces, discuss the agendas, and hopefully drive future online discussions.\\n\\nIf everyone is cool, maybe we can do a 90 minute on Thursday during TPAC? \\n\\n---\\nZi Bin Cheah\\nHTML5 Chinese IG chair\\n\\n\\n\",\n",
       " 'Signature': 'Zi Bin Cheah\\nHTML5 Chinese IG chair\\n\\n\\n'}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "extract_content(data_list[0], name_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 提取实体"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def observe_data(data_list):\n",
    "    for data in data_list:\n",
    "        content_dict = extract_content(data, name_list)\n",
    "        for k, v in content_dict.items():\n",
    "            if k == '(text/plain)':\n",
    "                print v"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def extract_entities(data):\n",
    "    entity_dict = {}\n",
    "    organizations = []\n",
    "    names = []\n",
    "    words = None\n",
    "    for k, v in extract_content(data, name_list).items():\n",
    "        if k == '(text/plain)':\n",
    "            words = pseg.cut(v)\n",
    "            for word, flag in words:\n",
    "                if flag == 'nt':\n",
    "                    organizations.append(word)\n",
    "                elif flag == 'nr':\n",
    "                    names.append(word)\n",
    "    names = list(set(names))\n",
    "    remove_list = [u'於', u'後', u'大大增加', u'索引', u'關於', u'麼', u'安', u'明白', u'连']\n",
    "    names = [name for name in names if name not in remove_list]\n",
    "    entity_dict['Organization'] = organizations\n",
    "    entity_dict['Name'] = list(set(names))\n",
    "    return entity_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Name': [u'\\u5361\\u5217\\u5c3c',\n",
       "  u'\\u5b89\\u5a1c',\n",
       "  u'\\u9b6f\\u8fc5',\n",
       "  u'\\u7b1b\\u5361\\u723e',\n",
       "  u'\\u5927\\u76f8',\n",
       "  u'\\u675c\\u9b6f\\u9580',\n",
       "  u'\\u6625\\u79cb\\u5de6\\u6c0f',\n",
       "  u'\\u8463\\u798f\\u8208',\n",
       "  u'\\u65af\\u5927\\u6797'],\n",
       " 'Organization': []}"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "extract_entities(data_list[4])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 提取关系"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def extract_relations(data):\n",
    "    relations_dict = {}\n",
    "    msg = mime.from_string(data)\n",
    "    for item in msg.headers.items():\n",
    "        if item[0] == 'In-Reply-To':\n",
    "            relations_dict[item[0]] = item[1]\n",
    "    return relations_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'In-Reply-To': u'<A8DD11E7EBEF4EF0AA731A864157B84F@gmail.com>'}"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "extract_relations(data_list[44])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 生成JSON格式的数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "result = {}\n",
    "with open('W3C.json', 'a') as f:\n",
    "    for data in data_list:\n",
    "        result['headers'] = extract_headers(data)\n",
    "        result['content'] = extract_content(data, name_list)\n",
    "        result['entity'] = extract_entities(data)\n",
    "        result['relation'] = extract_relations(data)        \n",
    "        f.write(json.dumps(result, indent=4, sort_keys=True))\n",
    "        f.write('\\n\\n')\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
